Comparison of Time-Course Short and Long Read RNA-seq Data

Author

Anna Toidze, Marie-Claire Indilewitsch, Michel Tarnow

Published

January 12, 2024

Introduction

The read throughput of long-read RNA-sequencing methods, for example the Pacific Biosciences (PacBio) platform, are a subject to improvement especially in connection to RNA isoform identification and quantification (Hardwick et al, 2019). This is particularly of interest when looking at alternative splicing which is a regulatory process that influences coding sequences, as well as translation efficiency, stability and other genetic processes by the differential splicing of exons during transcript maturation. Thereby it is a factor in the development of several diseases (Baralle et al, 2017). Correctly identifying alternative transcript isoforms is challenging as short-read technologies like Illumina have too short read lengths (50-600 bp, necessary would be > 5 kB) to correctly identify the splice sites even though they have a high throughput. Long-read technologies like PacBio or Oxford Nanopore enable sequencing with the necessary read lengths for RNA isoform identification and quantification, however, these platforms cannot attain the necessary read depths since they suffer from low read depths/throughput (smaller than the necessary 2 * 10^7) (Al’Khafaji et al, 2023).

A newly introduced method for long-read sequencing of transcript isoforms based on ISO-seq is multiplexed arrays isoform sequencing (MAS-ISO-seq), which was introduced by Al’Khafaji et al (2023). MAS-ISO-seq relies on the concatenation of cDNA fragments into longer sequence libraries with a narrow length distribution, and thereby exploits the full capability of the Ile Sequel platform compared to other protocols. The workflow is depicted in figure 1. In short, the cDNA fragments are distributed over multiple PCR’s, each of which appends specific barcode adapters containing deoxy-uracil (dU). Following, the amplified samples are pooled again and undergo dU digestion. Finally, the MAS-ISO-seq workflow creates concatenated cDNA fragments by barcode-directed adapter ligation of the original cDNA fragments. These concatenated fragments can then be used for PacBio sequencing (Al’Khafaji et al, 2023).

Figure 1: MAS-ISO-seq workflow. MAS-ISO-seq relies on the concatenation of cDNA fragments into longer sequence libraries to fully use the capacities of the PacBio Sequel Ile platform. Figure from Al’Khafaji et al (2023).

The performance of the MAS-ISO-seq protocol was tested by Al’Khafaji et al (2023) using different sample sets. In summary, the authors could show that RNA isoforms could be identified much more confidently with the MAS-ISO-seq protocol than with short-read methods when applied to both a SIRV sample set and a single cell sample set. Moreover, error robustness and >15-fold increase in throughput of the PacBio Sequel Ile platform could be shown (Al’Khafaji et al, 2023).

Here, we performed a similar analysis comparing two different count data sets, one derived from the MAS-ISO-seq protocol (long-read sequencing) and one derived from Illumina sequencing (short-read sequencing). The data sets contain samples from WTC-11 cells (induced pluripotent stem cells) that underwent a directed differentiation from pluripotent stem cells to hemogenic endothelial cells. More details can be found here: directed-differentiation-hemogenic-endothelial-cells-from-human. For six days, 1 – 3 samples were prepared per day and sequenced using both the PacBio and Illumina platforms. The data was subsequently quantified with Bambu (Chen et al, 2023) in the case of the PacBio sequence data and with Salmon (Patro et al, 2017) in the case of Illumina sequence data. The quantification was performed using 4 different novel discovery rate (NDR) thresholds (NDR thresholds of 0.025, 0.05, 0.1 and 0.2 were applied here). The obtained data sets were finally compared in an exploratory data analysis (EDA), a differential gene expression (DGE) analysis and a differential transcript usage (DTU) analysis to explore the performance of the MAS-ISO-seq protocol compared to Illumina sequencing, which has been used and thus underwent optimization for much longer.

Methods

The data analysis steps were performed in RStudio (Posit team (2023), version 2023.6.2.561, R version 4.3.1). In this report file, not all code is shown, however, it can be found in the respective .qmd file (see report.qmd) or in other code files specified in the different methods sections below (see code/). An analysis of spiked-in control sequences can be found at code/qmd/ctrl_seq.qmd.

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

First, the data was imported and preprocessed (see code/qmd/import_processing.qmd). Hereby, the control sequences were filtered out and counts and transcripts-per-million (TPM) saved separately for Illumina+Salmon (short read) data and PacBio+Bambu (long read data). The 16 different data frames (8 for TPM, 8 for counts) were sorted to have the same order and saved in .rds files (bambu, salmon and metadata) which were then used for further analysis.

Exploratory Analysis

The exploratory analysis was divided into (1) a direct comparison of the differences in the counts/TPM between Illumina+Salmon and PacBio+Bambu for all 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), and their visualization, (2) the computation of correlations between the different samples of Illumina+Salmon and PacBio+Bambu stratified by transcript length and (3) MDS plots. The whole analysis was performed for count data as well as TPM data but it was decided to keep the TPM in the report file as it is more representative when performing the stratification by length (further explained in discussion). The count plots as well as all other TPM plots not shown in the following can be found in code/qmd/comparison_pacbio_illumina.qmd.

Differences Illumina+Salmon vs. PacBio+Bambu

First, the direct differences between the Illumina+Salmon and PacBio+Bambu data frames were calculated and visualized using histograms to see how different the raw data is. The histograms were only plotted for the first column of the data frame (sample 1) as plotting all columns would not lead to an increase in information.

Correlation Illumina+Salmon vs. PacBio+Bambu

Secondly, the correlation between Illumina+Salmon and PacBio+Bambu was investigated at first not stratified by length and subsequently stratified by length on the transcript level (TPM) for all 4 NDR thresholds. Hereby, the samples were divided into 10 bins always portraying an interval of transcript length of 1000 (0-1000, 1000-2000 and so on), meaning that the bins had different sizes. Both Pearson and Spearman correlation between all samples were computed and visualized in heat maps, however, in the results only the Spearman plots are shown (explained in the discussion). For all 10 bins the correlations were summed up and shown in a scatter plot to see the effect of transcript length on the correlation for all 4 NDR thresholds.

MDS Plots

Lastly, the MDS plots were plotted for all 8 data frames (here the count data frames were used) to compare the sample distribution between Illumina+Salmon and PacBio+Bambu.

Differential Gene Expression and Gene Set Analyses

Differential Gene Expression Analysis

The aim of the Differential Gene Expression (DGE) analysis was twofold: first, the aim was to identify genes that are differentially expressed (DE) over the course of time, so over the course of the 5 days at which samples have been generated; second, the aim was to compare the results of the DGE analysis of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data). To do so, in total 8 different count data frames were analyzed. For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), there are two different count data frames, one derived from Illumina+Salmon and one derived from PacBio+Bambu. For the analysis of DE genes, a standard pipeline from the R package edgeR (Robinson et al, 2010, R package version 3.42.4) was used.

Before performing the DGE analysis itself, some genes of interest were investigated. These genes are targeted by the directed differentiation treatment the cells underwent, and therefore might show changes in their expression over time. Here, the following genes were investigated:

  • CTNNB1 (catenin beta 1)
  • ETV2 (ETS variant transcription factor 2)

Afterwards, the DGE analysis followed. The most important steps of the analysis included filtering of genes that do not have a sufficient count in a worthwhile number of samples, calculation of scaling factors (for library sizes) and posterior dispersion estimates, fitting of a negative binomial generalized log-linear model to each gene and performing a likelihood ratio test. To do so, a design matrix was created using the following formula: ~ day. Day specifies the day at which each sample was created, so days 0 to 5. Thus, the design matrix was created in a way that the first time point, day 0, is the baseline (intercept of the model). In the likelihood ratio tests, the full model was compared to a reduced model which has values of zero for all coefficients but the baseline, so for days 1 to 5. For each gene, p-values and other test statistics of the likelihood ratio test were computed and used afterwards to identify genes that showed statistical significant differential expression. Adjusted p-values were computed by the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and the DE genes were filtered by a significance threshold of 0.01.

To compare the results for Illumina+Salmon with the results for PacBio+Bambu, an UpSetR (Gehlenborg, 2019, R package version 1.4.0) plot was created for each NDR threshold. This plot overlayed the genes called DE for the two different cases. Moreover, the p-values and logFC values (logFC comparing day 0 and day 5) for each gene were compared using simple scatter plots.

Grouping of Differentially Expressed Genes

After the DGE analysis, the genes identified as DE were grouped into gene sets. For this grouping, only the genes that were identified as DE by the analysis of both the Illumina+Salmon and the PacBio+Bambu data of the 0.2 NDR threshold were considered. Moreover, the genes were grouped depending on their Illumina+Salmon counts since these counts were treated as a “ground truth” to which the PacBio+Bambu counts were compared.

Of these genes, the average expression at each time point was calculated first, so the average of the samples of each time point. Then, all gene-gene Spearman-correlations were computed and 1-correlation was used as the distance between two genes during hierarchical clustering. The results of the hierarchical clustering were then evaluated to group the genes into clusters. From visual inspection of a cluster dendrogram and a correlation heat map, the number of clusters was set manually. Lastly, the average expression of the genes of the clusters at each time point was calculated for each cluster. The clusters were then used as gene sets for the following gene set analysis.

Gene Set Analysis

To make sense of the genes that were called DE, they were first grouped into gene sets of similar genes (see above). These gene sets were then used as input for a gene set analysis. Here, the frequency-based method DAVID (Sherman et al, 2022, https://david.ncifcrf.gov/home.jsp) was employed.

The Database for Annotation, Visualization and Integrated Discovery (DAVID) tool is a web tool for functional annotation and enrichment analyses of gene sets, e.g. to find enriched gene ontology terms. To do so, the web tool requires the user to upload not only a list of genes, so the gene set of interest, but also another list of genes, the background. The tool then searches for functional annotations that are statistically overrepresented (enriched) in the gene set of interest compared to the background. Therefore, the choice of background can influence the results of the analysis. Here, the gene sets of interest were the clusters created during the grouping of the DE genes and as background all genes that remained after the filtering were used. The only exception to that are genes that were newly discovered during the transcript quantification since these genes do not have a valid ensemble gene id which is a requirement of DAVID.

Differential Transcript Usage Analysis

Differential Transcript Usage (DTU) was performed to identify gene expression alterations and isoform switching (Marques-Coelho et al, 2021). DTU complements DGE as it enables the identification of alternative splicing and isoform switches even if the gene expression does not change between conditions (Marques-Coelho et al, 2021; Love et al, 2018). The aim was to identify proportional differences in isoform composition of a gene in samples generated over course of the 5 days and to compare the results of the DTU analysis of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data).

The available workflows (Love et al, 2018; Lin, 2021) were tested on the given data set (see code/qmd/DTU_test_analysis.qmd). The workflow made use of 3 libraries vignettes: DRIMSeq, DEXSeq and stageR. DEXSeq assumes that the feature (transcript) counts follow a negative binomial (NB) distribution and considers them as relative proportions of the group (the gene) using an interaction term in a generalized linear model (GLM) (Love et al, 2018; Anders et al, 2012). DRIMSeq assumes that feature proportions follow the Dirichlet distribution (an Dirichlet Multinomial model (MD) for each gene) where the total count for the gene is considered fixed (Love et al, 2018; Nowicka et al, 2016). The filtering was done with DRIMSeq package with both the minimal number of samples where the genes should be expressed and the the minimal number of samples where the features should be expressed set to 3 (the number of replicates on day 0 and 5). Additional filters were also added: it was required for a transcript to (1) have a count of at least 10 in at least 3 samples, (2) have a relative abundance proportion of at least 0.1 in at least 3 samples, and (3) have a total count of 10 in the corresponding gene in all 3 samples. The filtered DRIMSeq object was used for the DEXSeq workflow, as the filtering was easier, and the analysis was sped up. The design formula was taken as ~sample + exon + day:exon (same as for DGE analysis). DEXSeq analysis included the estimation of size factors, estimation of dispersion, and likelihood ratio tests for differential exon usage. The results tables (log2 fold changes and p-values) were generated and per-gene adjusted p-values were computed. The stageR package was used for post-processing of the calculated p-values through stage-wise analysis of the data, by screening first at the gene level for evidence of DTU (screening stage) and confirming which transcripts within those significant genes show evidence of DTU (confirmation stage). stageR analysis was performed with an overall false discovery rate (OFDR, alpha parameter) equal to 0.05. A custom function plotExpression was created to plot the transcript expression of a significant gene (Lin, 2021). UpSetR plots and Venn diagrams were used to visualize the overlap between the transcripts identified by the two methods.

Results

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

Load data

The data is loaded from RDS-object files (bambu, salmon and metadata, see code/qmd/import_processing.qmd). For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), two data frames exist containing the expression data (counts) as well as the TPMs of the two different technologies used to obtain the data:

1. Illumina sequencing and Salmon quantification (short read)
2. PacBio sequencing and Bambu quantification (long read)

In total, this adds up to 16 different data frames. Each of these data frames contains 11 samples, which are replicates from one of five different time points. Each time point has 1-3 replicates.

Differences

The differences between Illumina+Salmon and PacBio+Bambu have been computed in a direct comparison and it was calculated how many percent had a difference smaller than 50 to get an initial overview of the raw differences.

[1] "Differences transcript level (0.025, 0.05, 0.1, 0.2):"
[1] 0.9232244
[1] 0.9232975
[1] 0.9234513
[1] 0.9238523
[1] "Differences gene level (0.025, 0.05, 0.1, 0.2):"
[1] 0.8487736
[1] 0.8491386
[1] 0.8496814
[1] 0.851743

On the transcript level, the differences were smaller than 50 for around 92% of the transcripts for all NDR thresholds, and on the gene level for around 85% for all NDR thresholds.

Histograms of Differences

The differences were then visualized for the first column of the data frames. Here, only the plots for the 0.025 and 0.05 thresholds on transcript and gene level are shown because all 8 histograms look very similar. The other histograms can be seen in code/qmd/comparison_pacbio_illumina.qmd.

As one can see, most differences are around 0 (highest bar) with small bars around it. Thus, when just looking at the raw data the difference between Illumina+Salmon and PacBio+Bambu seems to be small.

Correlation and Stratification by Length

Next, the correlation between Illumina+Salmon and PacBio+Bambu was computed. For that, first the transcript lengths were imported, and second, their distribution was investigated to be able to divide them into sections for the stratification by transcript length.

As shown in the plot above, most transcripts have a length between 0 and around 2000 bp, and at around 10000 bp almost no transcripts can be found. Thus, the transcript lengths were divided into intervals of 1000 bp starting from 0 bp up to 10000 bp. Here it is of note that the intervals do not contain the same amount of transcripts, so the bins have different sizes.

First, the correlation not stratified by length was plotted. Both Pearson- and Spearman-correlation was computed, but only the Spearman plot of the first NDR threshold is shown for transcript and gene level (explained in the discussion).

In the spearman heat maps, there is a lower correlation on the gene level compared to the transcript level. Moreover, the day 2 and day 3 Bambu samples seem to be highly correlated also with the day 4 and day 5 Salmon samples.

Second, the correlation stratified by length was plotted for all 4 NDR thresholds. Due to no further information gain, only one Pearson plot and representatively one Spearman heat map (for one bin, we chose 1000-2000 as its the most abundant transcript length) for each NDR threshold as well as all 4 scatter plots of the summed up correlation stratified by length are shown here. The individual heat maps are not shown in the report but can be commented out in the .qmd file (see report.qmd) for their visualization.

The Pearson plots do not look as expected since day2-rep1-Bambu is highly correlated with all Salmon samples, and the rest of the correlations seem to be quite low. When observing the 0.025 heat map with Spearman correlation one can see most of the samples being highly correlated with themselves (or with their replicates), and clearer the highly correlated diagonal usually expected in these heat maps.

The remaining heat maps for the other thresholds (0.05, 0.1, 0.2) not shown here look very similar to the 0.025 heat map. Looking at the scatter plots of the summed up correlations stratified by transcript lengths, they also look almost the same comparing the different thresholds. One can observe the clear trend that with a increasing transcript length the correlation is also increases, and that for very small transcript lengths (<1000 bp) the correlation is very low.

MDS Plots

The MDS plots look very similar comparing all 4 NDR thresholds. When comparing the MDS plots between Salmon and Bambu the MDS1 separation by day is almost the same while the MDS2 separation shows more differences. In Bambu the replicates are grouped slightly closer together than in Salmon, and fewer samples are above 0.

Differential Gene Expression and Gene Set Analyses

Since the DGE analysis of the different NDR thresholds showed very similar results, only the 0.2 NDR threshold is shown here. The analysis of all 4 NDR thresholds can be found in code/html/dge_analysis.html and in code/qmd/dge_analysis.qmd.

Load data

The data is loaded from RDS-objects that have been created with this file: code/qmd/import_processing.qmd. For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), two data frames containing the expression data (counts) of the two different technologies were loaded for the DGE analysis:

1. Illumina sequencing and Salmon quantification (short read)
2. PacBio sequencing and Bambu quantification (long read)

In total, this adds up to 8 different data frames. Each of these data frames contains 11 samples, which are replicates from one of six different time points. Each time point has 1-3 replicates. As stated above, in this file only the results of the 0.2 NDR threshold data is shown. For the DGE analysis, the counts were loaded. For the investigation of genes of interest, the TPMs were loaded.

   sampleID condition time
1         1 day0.rep1    0
2         2 day0.rep2    0
3         3 day0.rep3    0
4         4 day1.rep1    1
5         5 day2.rep1    2
6         6 day3.rep1    3
7         7 day3.rep2    3
8         8 day4.rep1    4
9         9 day5.rep1    5
10       10 day5.rep2    5
11       11 day5.rep3    5

Genes of Interest

To investigate expression changes of the genes of interest, their average expression at each time point was computed and then plotted.

# calculate average expression per time point for Salmon and Bambu

## Salmon
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr_salmon <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
  rowMeans(salmon_0.2_tpm_gene[, which(meta_samples$time == time)])
}))

# for time points 1, 2 and 4, only one replicate exists
avg_expr_salmon$V4 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 1)]))
avg_expr_salmon$V5 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 2)]))
avg_expr_salmon$V6 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 4)]))

avg_expr_salmon <- avg_expr_salmon[, c(1,4,5,2,6,3)]

colnames(avg_expr_salmon) <- sort(unique(meta_samples$time))

## Bambu
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr_bambu <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
  rowMeans(bambu_0.2_tpm_gene[, which(meta_samples$time == time)])
}))

# for time points 1, 2 and 4, only one replicate exists
avg_expr_bambu$V4 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 1)]))
avg_expr_bambu$V5 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 2)]))
avg_expr_bambu$V6 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 4)]))

avg_expr_bambu <- avg_expr_bambu[, c(1,4,5,2,6,3)]

colnames(avg_expr_bambu) <- sort(unique(meta_samples$time))

The first gene investigated was CTNNB1 (catenin beta 1). This gene was targeted by the treatment on days 0 and 1. Here, the addition of GSKi activated Wnt-signaling which in turn activated B-catenin providing transcriptional regulation. A trend is visible showing an increase in expression in the first days, and then the expression reaches a plateau for the last days, agreeing with the addition of GSKi on days 0 and 1. On days 3 - 5, the treatment continues with the addition of BMP4 and VEGF. BMP4 induces the transition of mesodermal cells into endothelial lineages through usage of upstream Indian-Hedgehog signaling and downstream Smad signaling. Moreover, the gene ETV2 (ETS variant transcription factor 2) is activated due to the addition of VEGF, which is added starting from day 3 as described above. Here, almost no expression can be detected before day 3, followed by a major increase on day 3 and decreasing expression on days 4 and 5.

When comparing the expression levels of the Illumina+Salmon data and of the PacBio+Bambu data, the general trend of the expression levels seem to be similar. Furthermore, the expression levels are in agreement with the treatment protocol.

Create DGEList Objects and Filter Expressed Genes

The expression data was then filtered using the function filterByExpr to exclude genes that did not show high enough expression or did not show expression in enough samples.

day <- factor(meta_samples$time)
y_salmon_0.2 <- DGEList(counts = salmon_0.2_gene, group = day)
y_bambu_0.2 <- DGEList(counts = bambu_0.2_gene, group = day)

expressed_salmon <- filterByExpr(y_salmon_0.2)
expressed_bambu <- filterByExpr(y_bambu_0.2)

not_expressed <-
  intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))

if (length(not_expressed) > 0) {
  y_salmon_0.2 <-
    y_salmon_0.2[-which(rownames(salmon_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
  y_bambu_0.2 <-
    y_bambu_0.2[-which(rownames(bambu_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}

Differential Gene Expression Analysis

The DGE analysis aimed to identify genes that are DE over the time course. These results were computed here only for the 0.2 NDR threshold of both technologies, and were then overlayed to compare the two technologies, so the long read and short read sequencing.

The design matrix was created in a way that the first time point, day0, was used as a baseline (intercept of the model).

# create design matrix
design <- model.matrix( ~ day)
design
   (Intercept) day1 day2 day3 day4 day5
1            1    0    0    0    0    0
2            1    0    0    0    0    0
3            1    0    0    0    0    0
4            1    1    0    0    0    0
5            1    0    1    0    0    0
6            1    0    0    1    0    0
7            1    0    0    1    0    0
8            1    0    0    0    1    0
9            1    0    0    0    0    1
10           1    0    0    0    0    1
11           1    0    0    0    0    1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$day
[1] "contr.treatment"

The DGE analysis was performed using edgeR, and genes were considered as DE if the adjusted p-value of the likelihood ratio test between the full and the reduced model was below 0.01. The p-value adjustment was performed using the Benjamini-Hochberg method.

# Salmon 0.2

y_salmon_0.2 <- calcNormFactors(y_salmon_0.2)
## estimate dispersion
y_salmon_0.2 <- estimateDisp(y_salmon_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
## save lrt for pathway analysis
lrt_pathway <- lrtS
## multiple testing correction
decide_dif_s0.2 <-
  decideTests.DGELRT(lrtS,
                     adjust.method = "BH",
                     p.value = 0.01)
summary(decide_dif_s0.2)
       day5-day4-day3-day2-day1
NotSig                    18289
Sig                        7723
# Bambu 0.2

y_bambu_0.2 <- calcNormFactors(y_bambu_0.2)
## estimate dispersion
y_bambu_0.2 <- estimateDisp(y_bambu_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
## multiple testing correction
decide_dif_b0.2 <-
  decideTests.DGELRT(lrtB,
                     adjust.method = "BH",
                     p.value = 0.01)
summary(decide_dif_b0.2)
       day5-day4-day3-day2-day1
NotSig                    17667
Sig                        8345

For Illumina+Salmon, out of the 26012 tested genes, 7723 were identified as DE. In the case of PacBio+Bambu, 8345 genes were identified as DE. Both methods identified roughly the same amount of genes as DE.

From the UpSet plot shown above it is visible that, even though identifying roughly the same number of genes as DE, the two technologies show some differences in their results. The majority of genes identified overlap, however, there are still some genes only called DE by either Illumina+Salmon or by PacBio+Bambu. To further investigate the similarities/differences, the logFC and p-values of the analysis were compared. The p-values were calculated for the likelihood ratio test, the logFC values, however, were calculated for the discrete comparisons of the days 1-5 with day 0. Here, the logFC values of day 5 were compared.

The comparison of the p-values after adjustment shows a similar trend in both technologies, but also visible differences. The same is observable for the logFC values. Additionally, it is visible that many genes that show a logFC value of zero in the analysis of one technology show a (high) logFC value in the other one. This agrees with the observation made in the UpSet plot, namely that some genes are called DE by either of the two technologies, but not by both.

The DGE analysis mainly showed two things. First, it showed that the results of the different NDR thresholds were very similar (not shown here). Second, it showed that the results of the two different technologies, Illumina+Salmon and PacBio+Bambu, were quite similar, but still showed some differences. When treating the Illumina+Salmon data as the ground truth, it can be said that PacBio+Bambu is close to the truth, but still shows deviations.

Following the DGE analysis, the identified genes were grouped into gene sets of similar genes and a gene set analysis was conducted using these gene sets. To do so, only the genes that were detected by the analyses of both technologies were used, so the overlap which can be seen in the UpSet plot. Since the results of the different NDR thresholds were very similar and thus contained mostly the same genes, only one of the thresholds, namely the NDR threshold of 0.2, was used.

Grouping of Differentially Expressed Genes

To group the DE genes, the average expression of each gene at each time point was calculated first (for more details, see calculation in Results - Differential Gene Expression and Gene Set Analyses - Genes of Interest).

The genes were then clustered using the their correlation as a distance measure (to be more precise: 1-correlation). A heat map of the correlations including a cluster dendrogram of the hierarchical clustering are shown below.

corr_DEG <- cor(t(avg_expr[DEG,]), method = "spearman")
hcl_DEG <- hclust(as.dist(1 - corr_DEG), method = "complete")
cl_DEG <- cutree(hcl_DEG, k = 4)
heatmap.2(corr_DEG, Rowv = as.dendrogram(hcl_DEG), Colv = as.dendrogram(hcl_DEG),
          trace = "none", scale = "none", labRow = NA, labCol = NA, col = viridis,
          ColSideColors = rainbow(15)[cl_DEG])

The genes were then split into 4 clusters by visual inspection, which can be seen in the plot above. The bar indicates the 4 clusters by different colors (yellow, red, green and orange). The average expression of all genes of each cluster at each time point is shown below in box plots:

The expression pattern of cluster 1 seems to follow a downwards trend, meaning that cluster 1 might contain genes that are mostly down-regulated over the time course. Clusters 2 and 3 show an upwards trend in expression, so these clusters might contain many genes that are up-regulated over the time course. Cluster 4 does not follow an apparent trend, there are many outliers visible at all time points and the vertical position of the boxes seem to fluctuate rather than following a trend. Moreover, this cluster is also the smallest one. For the following gene set analysis, the 4 clusters were used as gene sets.

Gene Set Analysis

The gene set analysis was performed using the tool DAVID, which is a frequency-based method and computes which gene ontology terms are significantly enriched in a gene set compared to a background.

To perform DAVID on the 4 gene sets, the ensemble gene ids of the genes were extracted and then written into text files, which were then uploaded to the DAVID web tool. Since some of the newly discovered transcripts could not be mapped to an existing gene, there were some genes in the gene sets that did not have an ensemble gene id compatible with DAVID. These genes were therefore removed from the analysis. As explained in the methods part, the background list contained all genes of the 0.2 NDR threshold data after filtering, with the exception of newly identified genes.

The results were then downloaded from the DAVID web tool as text files and contain terms describing the functions or pathways different genes are involved in. The results also contain other measures, like the count of genes of the input gene set that are associated with a specific term, as well as adjusted p-values showing the significance of this term in the input gene set. The results were then filtered by the adjusted p-values. Following, the results were sorted by their adjusted p-values in ascending order. Below, the top 15 most enriched terms are shown for each cluster:

# get significant terms
cluster1_sig <- cluster1 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster1_terms <- cluster1_sig$Term
print(cluster1_terms[1:15])
 [1] "KW-0770~Synapse"                                                  
 [2] "KW-0628~Postsynaptic cell membrane"                               
 [3] "KW-1003~Cell membrane"                                            
 [4] "hsa04727:GABAergic synapse"                                       
 [5] "hsa04723:Retrograde endocannabinoid signaling"                    
 [6] "KW-0325~Glycoprotein"                                             
 [7] "IPR018000:Neurotransmitter-gated ion-channel, conserved site"     
 [8] "IPR006029:Neurotransmitter-gated ion-channel transmembrane domain"
 [9] "IPR006202:Neurotransmitter-gated ion-channel ligand-binding"      
[10] "IPR006201:Neurotransmitter-gated ion-channel"                     
[11] "GO:0030594~neurotransmitter receptor activity"                    
[12] "hsa05033:Nicotine addiction"                                      
[13] "hsa05032:Morphine addiction"                                      
[14] "GO:0045202~synapse"                                               
[15] "KW-1015~Disulfide bond"                                           
cluster2_sig <- cluster2 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster2_terms <- cluster2_sig$Term
print(cluster2_terms[1:15])
 [1] "KW-0325~Glycoprotein"                    
 [2] "KW-1015~Disulfide bond"                  
 [3] "CARBOHYD:N-linked (GlcNAc...) asparagine"
 [4] "KW-0732~Signal"                          
 [5] "KW-0964~Secreted"                        
 [6] "GO:0005615~extracellular space"          
 [7] "KW-0217~Developmental protein"           
 [8] "KW-9996~Developmental protein"           
 [9] "GO:0005886~plasma membrane"              
[10] "GO:0005576~extracellular region"         
[11] "KW-0371~Homeobox"                        
[12] "IPR020479:Homeodomain, metazoa"          
[13] "IPR001356:Homeodomain"                   
[14] "IPR017970:Homeobox, conserved site"      
[15] "SM00389:HOX"                             
cluster3_sig <- cluster3 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster3_terms <- cluster3_sig$Term
print(cluster3_terms[1:15])
 [1] "GO:0005886~plasma membrane"               
 [2] "KW-0325~Glycoprotein"                     
 [3] "KW-1015~Disulfide bond"                   
 [4] "CARBOHYD:N-linked (GlcNAc...) asparagine" 
 [5] "KW-0472~Membrane"                         
 [6] "KW-1003~Cell membrane"                    
 [7] "KW-0732~Signal"                           
 [8] "TOPO_DOM:Cytoplasmic"                     
 [9] "KW-0130~Cell adhesion"                    
[10] "KW-0964~Secreted"                         
[11] "TOPO_DOM:Extracellular"                   
[12] "GO:0009986~cell surface"                  
[13] "KW-0037~Angiogenesis"                     
[14] "GO:0016021~integral component of membrane"
[15] "GO:0070062~extracellular exosome"         
cluster4_sig <- cluster4 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster4_terms <- cluster4_sig$Term
print(cluster4_terms[1:15])
 [1] "GO:0005730~nucleolus"                                                         
 [2] "GO:0003723~RNA binding"                                                       
 [3] "GO:0005654~nucleoplasm"                                                       
 [4] "KW-0539~Nucleus"                                                              
 [5] "GO:0005694~chromosome"                                                        
 [6] "GO:0006364~rRNA processing"                                                   
 [7] "KW-0597~Phosphoprotein"                                                       
 [8] "KW-0007~Acetylation"                                                          
 [9] "KW-0698~rRNA processing"                                                      
[10] "KW-0690~Ribosome biogenesis"                                                  
[11] "KW-0235~DNA replication"                                                      
[12] "GO:0032040~small-subunit processome"                                          
[13] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2)"
[14] "KW-0067~ATP-binding"                                                          
[15] "GO:0006260~DNA replication"                                                   

The significant results for gene set 1 include associations with neurological functions and pathways, for instance signalling at the neurological synapse. In total, 30 terms were identified as significantly enriched. For gene set 2, many significantly enriched terms were found (in total 250), which might be due to the large size of this gene set. The most significant terms include for instance Glycoprotein, Developmental Protein, Homeobox/domain and Extracellular Matrix. The results of gene set 3 include associations with the cell surface and the cell membrane as well as cellular signalling and cell adhesion. Here, 48 terms were found to be significantly enriched. Lastly, the results of gene sets 4 again include more terms then the ones of gene sets 1 and 3, but less then gene set 2 (in total 130). This seems to be surprising when just looking at the sizes of the genes sets, since gene set 4 is with 680 genes by far the smallest gene set. The significant terms of gene set 4 include functions related to the nucleus, DNA replication, transcription and translation.

When looking at all functions or pathways that DAVID found an association with, it seems like many different cellular components are involved in the changes over the course of time. Especially for the larger clusters, many different gene ontology terms were significant. From this analysis, it is difficult to make assumptions about specific pathways or cellular functions that are up- or down-regulated over the course of time. However, it can be said that the treatment of the cells with differentiation factors possibly effects many components of the cells.

Differential Transcript Usage Analysis

Making a Function for the Pipeline Using DEXSeq x DRIMSeq

plotExpression function adapted from Lin (2021).

For detailed explanations of the steps refer to this file: code/qmd/DTU_test_analysis.qmd.

# `samps` dataframe relates the sample identifiers to the conditions (in this case, day of the experiment)
samps <- df_list_meta$metadata
samps <- samps %>% column_to_rownames(var="sampleID")
colnames(samps) <- c("sample_id", "day")
samps$day <- factor(samps$day)

DTU <- function(data, txdf, sample_name, n = 11, n.small = 11){
  # Counts
  cts <- data
  colnames(cts) <- samps$sample_id
  # `counts` with the gene ID, the feature (transcript) ID, and then columns for each of the samples is built for dmDSdata object
  counts <- data.frame(gene_id=txdf$gene_id,
                     feature_id=rownames(txdf),
                     cts)
  # Creating dmDSdata object
  d <- dmDSdata(counts=counts, samples=samps)
   # Filtering
  d <- dmFilter(d,
                min_samps_feature_expr=n.small, min_feature_expr=10,
                min_samps_feature_prop=n.small, min_feature_prop=0.1,
                min_samps_gene_expr=n, min_gene_expr=10)
  # Visualization of filtered dataframe
  print(plotData(d))
  # The count of isoforms of the filtered genes
  # print(table(table(counts(d)$gene_id)))
  # DEXSeq pipeline
  sample.data <- DRIMSeq::samples(d)
  count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
  dxd <- DEXSeqDataSet(countData=count.data,
                       sampleData=sample.data,
                       design=~sample + exon + day:exon,
                       featureID=counts(d)$feature_id,
                       groupID=counts(d)$gene_id)
  dxd <- estimateSizeFactors(dxd)
  dxd <- estimateDispersions(dxd, quiet=TRUE)
  dxd <- testForDEU(dxd, reducedModel=~sample + exon)
  dxr <- DEXSeqResults(dxd, independentFiltering=FALSE)
  qval <- perGeneQValue(dxr)
  # Per gene
  dxr.g <- data.frame(gene=names(qval),qval)
  # Per transcript
  dxr.t = as.data.frame(dxr[, c("featureID","groupID","pvalue")])
  # Number of identified genes showing evidence for DTU
  print(paste0("Number of identified genes showing evidence for DTU through DEXSeq: ", nrow(dxr.g[dxr.g$qval < 0.05,])))
  # Number of transcripts involved
  print(paste0("Number of transcripts involved through DEXSeq: ", nrow(dxr[dxr$padj < 0.05,])))

  # stageR
  strp <- function(x) substr(x,1,15)
  # Vector of per-gene p-values for the screening stage
  pScreen <- qval
  names(pScreen) <- strp(names(pScreen))
  # One column matrix of the confirmation p-values for confirmation stage
  pConfirmation <- matrix(dxr$pvalue,ncol=1)
  dimnames(pConfirmation) <- list(strp(dxr$featureID),"transcript")
  # two column dataframe `tx2gene`with the transcript and gene identifiers
  tx2gene <-  data.frame(dxr.t[,c("featureID", "groupID")], dxr.t[,c("featureID", "groupID")])
  # OFDRm, alpha parameter equal to 0.05
  for (i in 1:2) tx2gene[,i] = strp(tx2gene[,i])
  stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
                        pScreenAdjusted=TRUE, tx2gene=tx2gene[1:2])
  stageRObj = stageRTx(pScreen = pScreen, 
  pConfirmation = pConfirmation, 
  pScreenAdjusted = TRUE, 
  tx2gene = tx2gene[1:2])
  
  stageRObj <-  stageWiseAdjustment(stageRObj, method = "dtu", alpha = 0.05)
  
  suppressWarnings({
    dex.padj <- getAdjustedPValues(stageRObj, order=FALSE,
                                   onlySignificantGenes=TRUE)
  })
  
  dex.padj = merge(tx2gene, dex.padj, by.x = c("groupID","featureID"), by.y = c("geneID","txID"))
  # print(head(dex.padj[order(dex.padj$gene, decreasing = T), ]))
  
  # Normalize counts
  dex.norm = cbind(as.data.frame(stringr::str_split_fixed(rownames(counts(dxd)), ":", 2)), as.data.frame(counts(dxd, normalized = TRUE))[,1:20])
  colnames(dex.norm) = c("groupID", "featureID", as.character(colData(dxd)$sample_id)[1:20])
  row.names(dex.norm) = NULL

  # Per-group normalised mean
  dex.mean = as.data.frame(sapply( levels(samps$day), 
  function(lvl) rowMeans(dex.norm[, 3:ncol(dex.norm)][, samps$day == lvl, drop = FALSE]) ))
  colnames(dex.mean) <- paste0("mean.day", colnames(dex.mean))
  
  # log2 fold change in expression
  dex.log2fc = log2(dex.mean[2]/dex.mean[1])
  colnames(dex.log2fc) = "log2fc"
  rownames(dex.log2fc) = dex.norm$featureID
  
  # Merge to create result data
  dexData = cbind(dex.norm[,1:2], dex.mean, dex.norm[, 3:ncol(dex.norm)])
  colnames(dexData)[1:2] <-  c("GeneID","TranscriptID")
  dexData = dexData[order(dexData$GeneID, dexData$TranscriptID),]
  
  # Merge to create result data
  dexDTU = merge(dex.padj[,c("featureID.1","groupID.1","gene","transcript")], dex.log2fc, by.x = "featureID.1", by.y = "row.names")
  colnames(dexDTU)[1:2] <-  c("TranscriptID","GeneID")
  dexDTU = dexDTU[order(dexDTU$GeneID, dexDTU$TranscriptID),]

  # Plot the normalised counts for one of the significant genes, where we can see evidence of switching
  gene_id <-  unique(dex.padj[order(dex.padj$transcript, dex.padj$gene),]$groupID.1)[1]
  print(paste0("Significant gene chosen for visualization: ", gene_id))
  
  print(plotExpression(dex.norm, gene_id, samps, isProportion = T))
  
  return(dexDTU)
}

Analysis for NDR 0.2

[1] "Number of identified genes showing evidence for DTU through DEXSeq: 1578"
[1] "Number of transcripts involved through DEXSeq: 2569"
[1] "Significant gene chosen for visualization: ENSG00000010671.17"

[1] "Number of identified genes showing evidence for DTU through DEXSeq: 3530"
[1] "Number of transcripts involved through DEXSeq: 7799"
[1] "Significant gene chosen for visualization: ENSG00000004864.14"

After DRIMSeq filtering, 10364 (16.47%) of genes and 28447 (11.42%) of features remained in the Illumina+Salmon data for further analysis. For PacBio+Bambu 7847 (12.47%) genes and (7.95) features passed the filter.

For Illumina+Salmon, out of the 10364 tested genes and 28447 tested features, DEXSeq test identified 1578 (15.23%) genes showing evidence of isoform switching involving 2569 (9.03%) transcripts. For PacBio+Bambu out of the 7847 tested genes and 19816 tested features, 3530 genes (44.99%) involving 7799 (39.36%) transcripts were identified.

After the stageR procedure, 2528 transcripts (8.89% of the filtered) passed the confirmation stage for Illumina+Salmon and 7108 transcripts (35.87% of the filtered) for PacBio+Bambu on a target 5% overall false discovery rate (OFDR). This means that, in expectation, no more than 5% of the genes that pass screening will either (1) not contain any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.

The genes were ordered according to their adjusted p-values for transcript and genes after the stageR procedure. The first significant gene was chosen for visualization. For Illumina+Salmon data we can see that the isoform ENST00000308731.8 from gene ENSG00000010671.17 (Bruton tyrosine kinase, BTK) starts getting transcribed by the end of the treatment starting from day 3 and increases significantly. Meanwhile, the transcription of the isoform ENST00000621635.4 stays approximately the same. For PacBio+Bambu we can see that the counts for isoform ENST00000265631.10 from gene ENSG00000004864.14 (solute carrier family 25 member 13, SLC25A13) drop to 0 on day 5, meanwhile, they increase for isoform ENST00000416240.6.

UpSet Plot

From the UpSet plot and Venn Diagram above it is visible that through PacBio+Bambu many more transcripts were detected than through Illumina+Salmon. The overlap between PacBio+Bambu and Illumina+Salmon was also only roughly 7%. This percentage is less than for the thresholds for NDR 0.025, 0.05 and 0.1 which showed the overlap of roughly 11%.

Comparsion of pAdj and logFC

The comparison of the adjusted p-values for transcripts shows low correlation between technologies especially for higher values. The logFC values seem to be, however, better correlated with some deviations for the extreme values.

Discussion

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

With the exploratory analysis of the data the goal was to get a first overview of both data sets as well as the differences in the raw data between Illumina+Salmon and PacBio+Bambu for all 4 different NDR thresholds and correlations between the two methods stratified by length.

The results for all 4 thresholds were similar, thus some plots were omitted in the report and can be viewed at code/qmd/comparison_pacbio_illumina_tpm.qmd. The small differences can be explained by the fact that the NDR thresholds determine the number of newly detected genes which doesn’t have an effect on the comparison of both methods. Secondly, the analysis was performed on count and TPM data. It was decided to keep in this report only the result for TPMs due to stratification by transcript length making these results more interesting and comparable, and also because the further analysis steps were performed on TPM rather than the raw count data. For completeness the count results can be found in a separate file, code/qmd/comparison_pacbio_illumina_counts.qmd.

Differences

Further, the differences between the Illumina data (short read data produced by a thoroughly optimized and long-in-use method) and the long read PacBio data (has not been optimized as much and potentially still contains biases) were investigated. In the raw data, the differences seemed to be quite small (the majority being <50) as well as the highest bar of the histograms being around 0. This was shown here only for the first column as all the other columns looked very similar. This would suggest that the Illumina and PacBio raw data seem to be quite similar with only minor deviations.

Correlation and Stratification by Length

Next, the correlation between Illumina and PacBio data was investigated. Both Pearson and Spearman correlations were computed, which differed highly. For Pearson correlation, one cannot say that each sample was highest correlated with itself as some samples were highly correlated with all samples of all days (usually sample day 1 or 2), whilst the Spearman heat maps looked as expected. This could be explained by the fact that Spearman correlation is more robust to outliers than Pearson, and does not assume a linear relationship which works better with our data.

Furthermore, the data was stratified into intervals of transcript length 1000 bp which resulted in the bin 1000-2000 bp containing the most transcripts and making them an irregular size, however, this evened out as both Bambu and Salmon had those bin sizes and in the scatter plots the correlations were summed up. Looking at the heat maps not stratified by length, they look very similar for all NDR thresholds, but the correlations are smaller on the gene level which could possibly be explained by the fact that the transcripts were summed up for the genes, thus resulting in fewer data and a smaller correlation. The heat maps for the length stratification again looked similar between the 4 NDR thresholds and showed as expected the highest correlation with their own sample between Salmon and Bambu as well as their replicates. This means that the individual samples between Bambu and Salmon do correlate well and are the most similar to their respective sample in the other method, thus both methods seem to produce comparable results although the correlation is not highly pronounced. This may be due to noise and outliers which even the Spearman correlation isn’t robust to, or just due to different results produced by the two methods.

Lastly, looking at the scatter plot which again is very similar for all 4 NDR thresholds, there is a clear trend of a better summed up correlation towards longer lengths but a very low correlation for shorter transcript lengths (<=700 bp). This might be caused by the dropout in the PacBio data as it does not detect so short transcript lengths. The trend towards longer transcript lengths could be due to the assignment of PacBio reads to transcripts getting more accurate with longer transcripts. This means that the PacBio and Illumina data are getting more correlated with longer transcript lengths and are less comparable with respect to shorter transcript lengths.

MDS Plots

The MDS plots look again very similar for all 4 NDR thresholds and are comparable between Illumina and PacBio. MDS1 separates the samples in both methods by days and MDS2 by replicates. For Salmon the replicates are more spread out than for Bambu, but the overall trend is similar with the differences potentially being explained by the differences of the two methods. This again shows that the raw (in this case count data) is comparable between the two methods.

Differential Gene Expression and Gene Set Analyses

Differential Gene Expression Analysis

The aim of the DGE analysis was to identify genes that are DE over the course of time, and to compare these results of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data) for 4 different NDR thresholds.

When comparing the different NDR thresholds, only minor differences could be detected. This was expected since the main difference between the data sets of the different thresholds were the number of newly detected genes. The more interesting part was the comparison between the short read data and the long read data. Here, the long read data was produced using a new protocol, the MAS-ISO-seq protocol, which displays an optimization for PacBio sequencing. To evaluate this, the results of the PacBio data was compared to the Illumina data results. It could be shown that for both cases roughly the same amount of genes were identified as DE. However, when overlaying these genes, there were still many genes only called DE in one of the two cases. Moreover, when comparing the adjusted p-values and logFC values of the likelihood ratio tests, deviations between the two cases were visible. Nevertheless, a common trend could be detected and the majority of DE genes tend to be called by both methods, this shows that the analysis of the PacBio data already comes close to the truth, when treating Illumina as the truth. Due to the lack of PacBio data from older protocols, it is not possible to draw any conclusions regarding the improvement of this protocol compared to others.

Grouping of Differentially Expressed Genes

The grouping of DE genes was only an intermediate step between the DGE analysis and the gene set analysis. Nonetheless, it showed some interesting results and could possible influence the outcome of the gene set analysis. The number of clusters/sets the genes were grouped in (here 4) was chosen more or less arbitrarily, mostly guided by visual inspection. Since the number of clusters was rather small, and the number of genes that were grouped was rather large, most of the clusters were rather large as well. Since the clusters were used as input gene sets for the gene set analysis, this choice possibly affected the outcome of the gene set analysis and will thus be discussed again in the context of the gene set analysis.

Gene Set Analysis

The gene set analysis was performed with the aim of making sense of the genes that the DGE analysis identified as DE. To do so, the detected DE genes were grouped into 4 gene sets, which were then used as input for the gene set analysis performed with the web tool DAVID.

For the gene set 1 and 3, around 30 to 50 gene ontology terms were found to be significantly enriched in the gene sets of about 1200 - 1500 genes. The terms found significant for these two gene sets were rather homogeneous. The same could be observed for gene set 4. Notably, the amount of terms that were detected to be significantly enriched in gene set 4 was much higher (130 terms) than in gene sets 1 and 3, even though gene set 4 was only around half the size of the other two (there are 680 genes in gene set 4). Assuming that more genes in a gene set usually lead to more terms to be found significant, this could mean that gene set 4 is less homogeneous than gene sets 1 and 3. In agreement with that, the average expression of genes in gene set 4 over time also shows more heterogeneity than observed in the other clusters. The largest gene set, gene set 2, shows the highest amount of significant terms (250 while containing 2158 genes). These results seem to be more heterogeneous than the results of the other 3 gene sets, which could be caused by the larger size of gene set 2 and/or by more heterogeneity between the genes of this gene set. To overcome this issue and get clearer results also for gene set 2, the number of clusters during the grouping of differentially expressed genes could have been chosen larger to create smaller and therefore possible more homogeneous gene sets. However, when cutting the tree with the cutree function, the largest cluster, so gene set 2, gets only notable smaller when setting the number of cluster to 13. At this point, many small clusters with less than 50 genes are created. Therefore, the number of clusters was left to be 4.

In general, the terms that were found to be significant by the gene set analysis for all gene sets cover a lot of different functions and pathways in the cell. This could be explained by the experimental conditions, and by the cell line that was used. The cell line, namely WTC-11 cells, are induced pluripotent stem cells and therefore posses embryonic stem cell-like properties and are of pluripotent nature. These cells were treated with differentiation factors, meaning that these factors drove the process of cellular differentiation, so the differentiation of a less specialized cell into a more specialized one. Thus, many different cellular function and pathways could be affected.

Differential Transcript Usage Analysis

The aim of the differential transcript usage analysis was to identify alternative splicing and isoform switches over the course of time, and to compare these results of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data) for 4 different NDR thresholds.

When comparing the different NDR thresholds, only minor differences could be detected. Overall, the number of screened genes and confirmed transcripts increased with the threshold. This was expected as the number of detected genes also increased with increasing NDR threshold. The shorter reads from Illumina+Salmon data fail to span successive splice sites, which impairs the identification of isoform switching (Al’Khafaji et al, 2023), so that only roughly 10% of the filtered features passed the stageR confirmation stage from Illumina+Salmon data. Meanwhile, longer reads from PacBio+Bambu span the majority of the human transcripts roughly 35% of the filtered features passed this stage for PacBio+Bambu (Al’Khafaji et al, 2023). Overall, the detected number of features were much higher for PacBio+Bambu and overlap between transcripts showing isoform switching was only roughly 10%. PacBio seems to still be more sensitive for isoform detection due to the better coverage of the splice-sites by longer reads.

Conclusion

In conclusion, the exploratory data analysis provided an overview of the differences in the raw data between Illumina and PacBio, and showed that the raw data, even though displaying some differences, seems to be quite comparable. This is the case especially when looking at the raw data differences, but also for the correlation, which still shows some difficulties in shorter transcript lengths due to the incompatibility of PacBio to detect very short transcripts. Further, the DGE analysis demonstrated a comparable identification of DE genes between Illumina+Salmon and PacBio+Bambu data, with some deviations. The subsequent gene set analysis revealed diverse functional enrichment in gene sets, influenced by the experimental conditions. Lastly, the DTU analysis highlighted that the number of detected features was much higher for PacBio+Bambu than for Illumina+Salmon as the longer read length of PacBio reads spans the majority of human transcripts. All in all, a thorough comparison of the two methods could be performed whilst highlighting different aspects of both methods. Thereby, it could be shown that, even though overall being quite similar, there are still major differences between the quantification of PacBio data and Illumina data, especially in the context of DTU analysis.

Literature

Al’Khafaji AM, Smith JT, Garimella KV, Babadi M, Popic V, Sade-Feldman M, Gatzen M, Sarkizova S, Schwartz MA, Blaum EM et al (2023) High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat Biotechnol

Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008-2017

Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ et al (2023) DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41: 232-238

Baralle FE, Giudice J (2017) Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol 18: 437-451

Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc B 57: 289-300

Chen Y, Sim A, Wan YK, Yeo K, Lee JJX, Ling MH, Love MI, Göke J (2023) Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods 20: 1187-1195

Conway JR, Lex A, Gehlenborg N (2017) UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33: 2938-2940

Hardwick SA, Joglekar A, Flicek P, Frankish A, Tilgner HU (2019) Getting the Entire Message: Progress in Isoform Sequencing. Front Genet 10

Lin H, 2021. Guide to RNA-seq Data Analysis. URL https://ycl6.gitbook.io/guide-to-rna-seq-analysis/differential-expression-analysis/differential-transcript-usage/dtu-using-dexseq.

Love M, Soneson C, Patro R (2018) Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification [version 3; peer review: 3 approved]. F1000Res 7

Marques-Coelho D, Iohan LdCC, Melo de Farias AR, Flaig A, Letournel F, Martin-Négrier M-L, Chapon F, Faisant M, Godfraind C, Maurage C-A et al (2021) Differential transcript usage unravels gene expression alterations in Alzheimer’s disease human brains. npj Aging Mech Dis 7: 2

Nowicka M, Robinson MD (2016) DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res 5: 1356

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14: 417-419

Posit team (2023). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.

Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140

Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W (2022) DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 50: W216-w221

Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND et al (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37: 1155-1162

Session Info and R Package Versions

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ggbeeswarm_0.7.2            stageR_1.22.0              
 [3] DEXSeq_1.46.1               RColorBrewer_1.1-3         
 [5] AnnotationDbi_1.62.2        DESeq2_1.40.2              
 [7] SummarizedExperiment_1.30.2 GenomicRanges_1.52.1       
 [9] GenomeInfoDb_1.36.4         IRanges_2.34.1             
[11] S4Vectors_0.38.2            MatrixGenerics_1.12.3      
[13] matrixStats_1.1.0           Biobase_2.60.0             
[15] BiocGenerics_0.46.0         BiocParallel_1.34.2        
[17] DRIMSeq_1.28.0              ggvenn_0.1.10              
[19] readxl_1.4.3                gplots_3.1.3               
[21] cowplot_1.1.1               patchwork_1.1.3            
[23] pheatmap_1.0.12             viridis_0.6.4              
[25] viridisLite_0.4.2           UpSetR_1.4.0               
[27] edgeR_3.42.4                limma_3.56.2               
[29] lubridate_1.9.3             forcats_1.0.0              
[31] stringr_1.5.1               dplyr_1.1.4                
[33] purrr_1.0.2                 readr_2.1.4                
[35] tidyr_1.3.0                 tibble_3.2.1               
[37] ggplot2_3.4.4               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0       jsonlite_1.8.7          magrittr_2.0.3         
 [4] farver_2.1.1            rmarkdown_2.25          zlibbioc_1.46.0        
 [7] vctrs_0.6.4             memoise_2.0.1           Rsamtools_2.16.0       
[10] RCurl_1.98-1.13         htmltools_0.5.7         S4Arrays_1.0.6         
[13] progress_1.2.2          curl_5.1.0              cellranger_1.1.0       
[16] KernSmooth_2.23-22      htmlwidgets_1.6.3       plyr_1.8.9             
[19] cachem_1.0.8            lifecycle_1.0.4         pkgconfig_2.0.3        
[22] Matrix_1.6-4            R6_2.5.1                fastmap_1.1.1          
[25] GenomeInfoDbData_1.2.10 digest_0.6.33           colorspace_2.1-0       
[28] geneplotter_1.78.0      RSQLite_2.3.3           hwriter_1.3.2.1        
[31] filelock_1.0.2          labeling_0.4.3          fansi_1.0.5            
[34] timechange_0.2.0        httr_1.4.7              abind_1.4-5            
[37] mgcv_1.9-0              compiler_4.3.1          bit64_4.0.5            
[40] withr_2.5.2             DBI_1.1.3               biomaRt_2.56.1         
[43] rappdirs_0.3.3          DelayedArray_0.26.7     gtools_3.9.5           
[46] caTools_1.18.2          tools_4.3.1             vipor_0.4.5            
[49] beeswarm_0.4.0          glue_1.6.2              nlme_3.1-164           
[52] reshape2_1.4.4          generics_0.1.3          gtable_0.3.4           
[55] tzdb_0.4.0              hms_1.1.3               xml2_1.3.5             
[58] utf8_1.2.4              XVector_0.40.0          pillar_1.9.0           
[61] genefilter_1.82.1       splines_4.3.1           BiocFileCache_2.8.0    
[64] lattice_0.22-5          survival_3.5-7          bit_4.0.5              
[67] annotate_1.78.0         tidyselect_1.2.0        locfit_1.5-9.8         
[70] Biostrings_2.68.1       knitr_1.45              gridExtra_2.3          
[73] xfun_0.41               statmod_1.5.0           stringi_1.8.2          
[76] yaml_2.3.7              evaluate_0.23           codetools_0.2-19       
[79] cli_3.6.1               xtable_1.8-4            munsell_0.5.0          
[82] Rcpp_1.0.11             dbplyr_2.3.4            png_0.1-8              
[85] XML_3.99-0.16           parallel_4.3.1          blob_1.2.4             
[88] prettyunits_1.2.0       bitops_1.0-7            scales_1.3.0           
[91] crayon_1.5.2            rlang_1.1.2             KEGGREST_1.40.1